Skip to main content

Module 02 - Python Observability

What You Will Learn

By the end of this module you will be able to:

  • Explain the three pillars of observability and why each one is irreplaceable
  • Configure production-grade structured logging with structlog and correlation IDs
  • Expose Prometheus metrics from a FastAPI service and write real PromQL queries
  • Instrument a Python microservice with OpenTelemetry and visualise traces in Jaeger
  • Capture, group, and alert on production exceptions with Sentry
  • Build health check endpoints that Kubernetes actually trusts

Prerequisites

RequirementWhy It Matters
Python 3.11+contextvars, asyncio, type hints used throughout
FastAPI basicsAll production examples use FastAPI
Docker + docker-composeEvery tool in this module runs locally via compose
Module 1 completeAsync patterns and profiling context assumed
Basic SQL / PostgreSQLIncident examples reference pg_stat_activity

The Incident That Starts Every Observability Story

It is 14:23 on a Tuesday. Requests per second on your Python API are normal. HTTP 200s are flowing. No exceptions in Sentry. No alerts in PagerDuty. But your product manager has just forwarded a screenshot from a paying customer: every action in the app takes 8–12 seconds instead of the usual 400ms.

You open your logs:

INFO:uvicorn.access: 200 POST /api/documents 11432ms
INFO:uvicorn.access: 200 POST /api/documents 9871ms
INFO:uvicorn.access: 200 POST /api/documents 12103ms

The service is returning 200 OK. Latency is terrible. Logs show nothing useful - no query, no user, no context. You have no metrics so you cannot see when it started or whether it is getting worse. You have no traces so you cannot see where the 11 seconds are actually going.

Four hours later, after crawling through application code and guessing, someone runs this on the database host:

SELECT count(*), state, wait_event_type, wait_event
FROM pg_stat_activity
WHERE datname = 'myapp_prod'
GROUP BY state, wait_event_type, wait_event
ORDER BY count DESC;
count | state | wait_event_type | wait_event
-------+--------+-----------------+------------
48 | active | Lock | relation
2 | idle | |
0 | ...

Connection pool exhaustion. The application pool was set to 10 connections. Under load it queued requests waiting for a free connection, each waiting up to the 30-second timeout. The service never returned an error because the requests eventually succeeded - just 11 seconds late.

Four hours of debugging for a problem that a single Prometheus gauge would have surfaced in four seconds.

This module is about never having that four-hour incident again.

Why Observability is Not Just Logging

Most engineers learn to "add logging" and consider observability done. That mental model breaks in production. Here is why the three pillars are each irreplaceable:

┌─────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY STACK │
├─────────────────┬──────────────────┬────────────────────────────┤
│ LOGS │ METRICS │ TRACES │
│ │ │ │
│ "What happened │ "How much / │ "Where did the time go?" │
│ and when?" │ how often?" │ │
│ │ │ │
│ Discrete │ Aggregated │ Causal chain across │
│ events with │ numerical │ services, showing │
│ context │ measurements │ parent-child timing │
│ │ over time │ │
│ structlog │ Prometheus │ OpenTelemetry + Jaeger │
│ Loki │ Grafana │ │
│ Datadog Logs │ Alertmanager │ │
├─────────────────┴──────────────────┴────────────────────────────┤
│ ERROR TRACKING │
│ "Which exceptions are happening, how often, │
│ and what is their full context?" │
│ Sentry / GlitchTip │
├─────────────────────────────────────────────────────────────────┤
│ HEALTH CHECKS │
│ "Is this service safe to receive traffic right now?" │
│ /liveness /readiness /startup │
└─────────────────────────────────────────────────────────────────┘

Logs: What Happened

A log is a discrete event record. It has a timestamp, a severity level, a message, and ideally a rich set of structured key-value context fields. Logs answer questions like:

  • "Which user triggered this error?"
  • "What SQL query ran before the exception?"
  • "What was the document_id being processed when the worker died?"

Logs are high cardinality - you can store as much context per event as you need. They are bad at answering "how often does this happen?" because that requires reading and counting many log lines.

Metrics: How Much / How Often

A metric is a numerical measurement aggregated over time. It answers:

  • "How many requests per second are we serving right now?"
  • "What is the p99 latency of the /api/classify endpoint?"
  • "How many database connections are currently in use?"

Metrics are low cardinality - you cannot store per-user data in a Prometheus label without blowing up cardinality. They are great for alerting because they are pre-aggregated and cheap to query.

Traces: Where Did the Time Go

A trace is a causal chain of timed operations across a distributed system. A single user request might touch an API gateway, two microservices, a database, Redis, and an external LLM API. A trace shows you:

  • The exact wall-clock time each service spent on the request
  • Which service was the bottleneck
  • The gaps between services (network, queues, serialisation)
  • Whether a slow downstream dependency caused a cascade

Traces answer the question that logs and metrics cannot: "The request took 800ms total - where did that time go?"

The Mistake: Thinking One Pillar Is Enough

ScenarioLogs aloneMetrics aloneTraces alone
High p99 latencySee individual slow requests but no patternSee the spike but not why or which serviceSee the bottleneck if you have a trace
Exception spikeSee exceptions with contextSee the rate spike but no contextTraces show span errors but not exception details
Connection pool exhaustionSee timeout errors but not pool stateSee the gauge and alert immediatelyNot directly visible without custom spans
Which user was affectedYes, if loggedNo - metrics are aggregatedYes, if user ID in span attributes

You need all three. They are complementary, not redundant.

The print() Problem

Every Python developer starts with print(). Here is what is wrong with it in production:

# What most beginners write
print(f"Processing document {doc_id}")
print(f"Error: {e}")
# What production requires
import structlog
log = structlog.get_logger()

log.info(
"document.processing.started",
document_id=doc_id,
user_id=current_user.id,
file_size_bytes=doc.size,
content_type=doc.content_type,
)

The difference is not cosmetic. With print():

  • There is no timestamp (or it is not machine-parseable)
  • There is no severity level - you cannot filter for errors only
  • There is no structured data - you cannot query document_id = "abc123" in Kibana
  • It goes to stdout with no buffering control - under load it will block your event loop
  • You cannot route it to different destinations (file, syslog, log aggregator)
  • You cannot suppress it in tests without redirecting stdout

With structured logging, a log line becomes a queryable document:

{
"timestamp": "2026-03-07T14:23:01.234Z",
"level": "info",
"event": "document.processing.started",
"document_id": "doc_8f3a2c",
"user_id": "usr_99f1b4",
"file_size_bytes": 204800,
"content_type": "application/pdf",
"service": "document-api",
"version": "2.14.0",
"environment": "production",
"request_id": "req_7e9d3b"
}

That single line can be searched, aggregated, alerted on, and correlated with traces - automatically.

A Metric Is Not a Log

A common mistake is trying to use logs as metrics:

# Wrong: trying to use a log query as a metric
log.info("cache_miss", key=cache_key)
# Then querying: count(event="cache_miss") per minute in Kibana

This works at small scale. At production scale:

  • Log ingestion has latency - your "metric" lags 30–60 seconds
  • Log storage is expensive - you are paying per GB for numerical data
  • Log queries are slow - COUNT queries on log indices are full scans
  • Log cardinality is unlimited - one bad log statement with a UUID label creates billions of series

The right solution: a Prometheus counter.

from prometheus_client import Counter

cache_misses = Counter(
"cache_misses_total",
"Total cache misses",
["cache_name", "operation"],
)

# In your cache layer:
cache_misses.labels(cache_name="document_cache", operation="get").inc()

Now rate(cache_misses_total[5m]) in PromQL gives you real-time cache miss rate with no log parsing, no latency, and negligible storage.

A Trace Is Not a Metric

Another common mistake:

# Wrong: using a histogram to find which service is slow
request_latency.labels(service="downstream-api").observe(latency)
# This tells you the downstream API is slow
# But it does NOT show you why - is it the network? The DB? A specific query?

A Prometheus histogram tells you that the downstream API is slow. A distributed trace tells you why - it shows you every operation inside that service with its individual timing, the exact SQL queries that ran, the Redis lookups that happened, and the outbound HTTP calls that were made.

Use metrics for alerting. Use traces for root cause analysis.

The Observability Stack Used in This Module

All tools in this module are open source and run locally with docker-compose:

ToolRolePort
structlogStructured logging library(library)
LokiLog aggregation and storage3100
PromtailLog shipper (files → Loki)9080
PrometheusMetrics scraping and storage9090
AlertmanagerAlert routing and deduplication9093
GrafanaMetrics and log dashboards3000
OpenTelemetry CollectorTrace collection and routing4317/4318
JaegerDistributed trace storage and UI16686
Sentry (self-hosted)Error tracking9000

Full docker-compose setup provided in Lesson 01.

Module Lessons

Lesson 01 - Structured Logging

The Python logging module internals, structlog pipeline configuration, correlation IDs via contextvars, JSON formatting, sensitive data masking, log aggregation with Loki, and async non-blocking log handlers. Transforms an unstructured service into one whose logs are instantly searchable.

Key deliverable: A logging_config.py module that any FastAPI service can drop in and immediately produce structured, correlated, JSON logs shipped to Loki.

Lesson 02 - Metrics with Prometheus

The Prometheus data model, all four metric types with real use cases, FastAPI auto-instrumentation, custom application metrics, PromQL for SRE work, Alertmanager rules, and a complete Grafana dashboard JSON.

Key deliverable: A metrics.py module with application-level metrics for a document processing service, 10 real PromQL queries, and 5 production alerting rules.

Lesson 03 - Distributed Tracing

OpenTelemetry Python SDK, auto-instrumentation for FastAPI / SQLAlchemy / Redis / HTTPX, custom spans for business logic, W3C trace context propagation, baggage, sampling strategies, and reading Jaeger waterfall diagrams.

Key deliverable: Full OpenTelemetry setup for a multi-service Python application with context propagation through HTTP, and trace IDs injected into log lines.

Lesson 04 - Error Tracking

Sentry Python SDK, enriching errors with user context and breadcrumbs, custom fingerprinting for error grouping, before_send hooks for sensitive data filtering, release tracking with source maps, and building an error triage workflow.

Key deliverable: A production Sentry configuration that groups errors intelligently, masks PII, and integrates with your release pipeline.

Lesson 05 - Health Checks and Readiness

Kubernetes liveness vs readiness vs startup probes, designing health checks that accurately reflect service health, parallel dependency checks with timeouts, SLOs and error budgets, synthetic monitoring, and health check anti-patterns.

Key deliverable: A complete /liveness, /readiness, and /startup implementation for a FastAPI service with PostgreSQL, Redis, and external API dependencies.

Observability Maturity Model

Before starting, assess where your service sits today:

LevelNameCharacteristics
0Darkprint() statements, no structure, errors discovered by users
1Basic Logslogging.basicConfig(), some log lines, unstructured text
2Structured LogsJSON logs with levels, timestamps, and some context fields
3Correlated LogsRequest IDs in every log line, logs shipped to aggregator
4MetricsPrometheus counters/histograms, dashboards, basic alerts
5Error TrackingSentry with user context, release tracking, error workflows
6TracingDistributed traces, p99 from traces, traces linked to logs
7Full ObservabilitySLOs, error budgets, synthetic monitoring, runbooks linked to alerts

Most production Python services in the wild sit at Level 1 or 2. This module takes you to Level 7.

How to Work Through This Module

Each lesson follows the same structure:

  1. Opening incident - a real production failure caused by missing observability
  2. Concepts - the theory, explained through the lens of what the incident needed
  3. Working code - production-grade implementations, not toy examples
  4. Integration - how this pillar connects to the others
  5. Interview Q&A - five questions asked at senior/staff engineering interviews

Run each lesson's code examples locally. By the end of Lesson 03, you will have a fully instrumented Python service with logs, metrics, and traces all running in docker-compose, all visible in Grafana.

Quick Reference: The Golden Signals

Before diving into implementation, here are the four signals every production service must measure (from Google's SRE Book):

SignalWhat It MeasuresPrometheus Metric Type
LatencyTime to serve a request (success vs error latency separately)Histogram
TrafficHow much demand is hitting the systemCounter
ErrorsRate of failed requests (5xx, explicit failures, wrong results)Counter
SaturationHow "full" the service is (CPU, memory, connection pools, queue depth)Gauge

These four metrics, exposed correctly, will catch 90% of production incidents before users notice them. Lessons 02 through 05 show you how to implement each one properly.

Let's build observable systems.

© 2026 EngineersOfAI. All rights reserved.